Model Selection

High-resolution image processing

# High-resolution image processing

Kimi VL A3B Thinking 2506

Kimi-VL-A3B-Thinking-2506 is an upgraded version of Kimi-VL-A3B-Thinking, with significant improvements in multimodal reasoning, visual perception and understanding, video scene processing, etc. It supports higher-resolution images and can achieve more intelligent thinking while consuming fewer tokens.

Style 250412.vit Base Patch16 Siglip 384.v2 Webli

A vision model based on the Vision Transformer architecture, trained using SigLIP (Sigmoid Loss for Language-Image Pretraining), suitable for image understanding tasks.

Image Classification

Eva02 Large Patch14 Clip 224.merged2b

The EVA CLIP model is a vision-language model based on OpenCLIP and timm model weights, supporting tasks such as zero-shot image classification.

Image Classification

Vit Huge Patch14 Clip 378.dfn5b

The visual encoder component of DFN5B-CLIP, based on ViT-Huge architecture, trained with 378x378 resolution images for CLIP model

Image Classification

Vit So400m Patch14 Siglip Gap 896.pali2 10b Pt

Vision model based on SigLIP image encoder with global average pooling, part of the PaliGemma2 model

Clip Finetuned Csu P14 336 E3l57 L

This model is a fine-tuned version of openai/clip-vit-large-patch14-336, primarily used for image-text matching tasks.

Idefics2 8b Chatty

Idefics2 is an open multimodal model capable of accepting arbitrary sequences of images and text as input and generating text output. The model can answer questions about images, describe visual content, create stories based on multiple images, or function purely as a language model.

Transformers English

Internvit 6B 448px V1 5

InternViT-6B-448px-V1-5 is a vision foundation model fine-tuned based on InternViT-6B-448px-V1-2, featuring strong robustness, OCR capabilities, and high-resolution processing.

Idefics2 8b Base

Idefics2 is an open-source multimodal model developed by Hugging Face, capable of processing image and text inputs to generate text outputs, excelling in OCR, document understanding, and visual reasoning.

Transformers English

ChatTruth-7B is a multilingual vision-language model optimized based on the Qwen-VL architecture, enhanced with large-resolution image processing capabilities and incorporating a restoration module to reduce computational overhead

Transformers Supports Multiple Languages

Vit Small Patch14 Dinov2.lvd142m

A vision Transformer (ViT)-based image feature model pre-trained using self-supervised DINOv2 method on the LVD-142M dataset

Image Classification

Vit Base Patch16 224 In21k Eurosat

A pre-trained model based on Google's Vision Transformer (ViT) architecture, fine-tuned on the EuroSAT dataset, suitable for remote sensing image classification tasks.

Image Classification

Segformer B5 Finetuned Cityscapes 1024 1024

A SegFormer semantic segmentation model fine-tuned on the CityScapes dataset at 1024x1024 resolution, featuring a hierarchical Transformer encoder and a lightweight all-MLP decoder head architecture.

Image Segmentation

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase